Introduction to Open Data Science - Course Project

Chapter 1: Getting Started

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

date()
## [1] "Mon Dec  4 16:59:10 2023"

hello. its 6:11 Thursday November 2nd. I expect this course to be a refresher on topics I am already familiar with. I found this course on sisu. Im looking forward to working with new data, using new visualizing techniques and some multivariate techniques. Im interested in dealing with missing data

This is an Introduction to Open Data Science course where we will look at visualizing and analyzing open data with multivariate statistical methods and follow practices that encourage open reproducible science. Link to my Github: https://github.com/kris-nader/IODS-project


Chapter 2: Linear regression

Describe the work you have done this week and summarize your learning. Summary of the week (completed work and learning)

Chapter 2 consists of 2 parts:
-Data Wrangling.In this section, we use survey data and make an analysis csv file using an R script (./data/create_learning2014.R).

-Data Analysis. In this section, we use the learning2014.csv file to do some exploratory analysis. This includes plots, linear regression, and qdiagnistic plots from our model nal model.

About the data: This data wad from a survey done to study the relationshop between learning approaches and student achievements in an intro course to statistics in finland which took place from 03-12-2014 to 10-01-2015.

Part 1: data wrangling

It is a 183x60 data(before processing/data wrangling): 56 likert-scale variables on a scale of 1-5 3 continuous variables age, attitude, points 1 factor variable gender(male/female).

We have then created an analysis dataset for this chapter. We can then run the ./data/create_learning2014.R script which results in the learningdata2014.csv dataset in the same data directory. In this script, we create new variables based on the metadata hosted here: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt and filter datapoints that have a points variable=0.

After filtering, we have 7 variables and 166 observations:

  • Gender: Male or Female.
  • Age: In years, ranging from 17 to 55.
  • Attitude: Global attitude toward statistics ~Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj.
  • deep: Deep approach. Calculated by averaging the likert scale from 1-5 scale of 3 techniques(Seeking meaning: d_sm Relating ideas:d_ri and Use of evidence:d_ue).
  • surf: Surface approach valculated simialr to deep based on 3 columns(lack of purpose:su_lp, unrelated memorizing:su_um and syllabus boundness:su_sb.
  • stra: Strategic approach, Calculated similar to deep(organized studying:st_os & time management:st_tm).
  • points: total points on an exam.

In addition, a pre-made learningdata2014.csv is available https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt.

We can use the following command to create the dataset, under the impression that the original data is in the data directory.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
#source("./data/create_learning2014.R")

After some data wrangling, we have 7 variables: Age, attitude, points, gender, deep, stra, and surf. The gender variable is a character (M/F) and others are doubles.

data=data.frame(read_csv('./data/learning2014.csv'))
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): Age, attitude, Points, deep, stra, surf
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ Age     : num  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ Points  : num  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
head(data)
##   Age attitude Points gender     deep  stra     surf
## 1  53      3.7     25      F 3.583333 3.375 2.583333
## 2  55      3.1     12      M 2.916667 2.750 3.166667
## 3  49      2.5     24      F 3.500000 3.625 2.250000
## 4  53      3.5     10      M 3.500000 3.125 2.250000
## 5  49      3.7     22      M 3.666667 3.625 2.833333
## 6  38      3.8     21      F 4.750000 3.625 2.416667
summary(data)  
##       Age           attitude         Points         gender         
##  Min.   :17.00   Min.   :1.400   Min.   : 7.00   Length:166        
##  1st Qu.:21.00   1st Qu.:2.600   1st Qu.:19.00   Class :character  
##  Median :22.00   Median :3.200   Median :23.00   Mode  :character  
##  Mean   :25.51   Mean   :3.143   Mean   :22.72                     
##  3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:27.75                     
##  Max.   :55.00   Max.   :5.000   Max.   :33.00                     
##       deep            stra            surf      
##  Min.   :1.583   Min.   :1.250   Min.   :1.583  
##  1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417  
##  Median :3.667   Median :3.188   Median :2.833  
##  Mean   :3.680   Mean   :3.121   Mean   :2.787  
##  3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167  
##  Max.   :4.917   Max.   :5.000   Max.   :4.333

We can do some exploratory analysis: Attitude and Points seem to have a linear relationship and also a positive linear relationship. Other relationships with points is a bit sketchy– not as obvious as the relationship with attitude.

pairs(data[,-4])

There is also an imbalance in the number of males and females in the survey.

barplot(table(data$gender), main="Frequency of F and M in the survey",xlab="gender", ylab="Frequency",col="lightpink")

Theres also a wide range of individuals in this survey from 17 to 55 and the most freq is 21.

barplot(table(data$Age), main="age in the survey",xlab="age", ylab="Frequency",col="lightgreen")

This is a good positive linear relationship between attitude and points and weaker between other variables and points( some negative– red circles).

M = cor(data[,-4])
corrplot::corrplot(M)

Now, we can build a linear regression model with some interesting variables. I have chosen attitude, stra and surf based on the correlation plot from above. Based on the summary, only attitude seems to be important predictor( based on the p value from the t statistic of 5.91 and comes from the estimate 3.3952 divided by the se error of 0.5741) and others are not significant in terms of p values. For example, surf has a p value of 0.46 so thats super insignificant. We will remove it and reformulate the model. In this lm1 model, the R2 is 0.192 which means that these variables only explain 19.2% of the variablity in points which is not very good.

The coefficient of attittude can be interpreted as : as an increase of 1 unit in attitude results in an increase of exam points by 3.39 units if other variables are held constant.

lm1=lm(Points ~ stra + attitude + surf, data = data)
summary(lm1)
## 
## Call:
## lm(formula = Points ~ stra + attitude + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## stra          0.8531     0.5416   1.575  0.11716    
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Because surf had the highest pvalue(bad), we remove it first and we see that stra is now significant. R2 of 0.195 which is only a bit better( only 19% of the variablity in points is explained by this model). We will remove stra next but lm2 is my final model. In this case, there is a good relationship between attitude and points and a small but still technically statistically sig relationship with stra but only when they are together( would need to check independently). In this case, 1 unit increase in attitude results in a 3.4 unit increase in points when stra is held constant and 1 unit increase in stra results in a 0.91 unit increase in points when attitude is held constant.

lm2=lm(Points ~ stra + attitude , data = data)
summary(lm2)
## 
## Call:
## lm(formula = Points ~ stra + attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

If we remove it again( if we only want really signif predictors), the r2 goes down to 0.18 so we better keep it. This means only 18% of the variability

lm3=lm(Points ~ attitude , data = data)
summary(lm3)
## 
## Call:
## lm(formula = Points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Based on the diagnostic plots we can see if our model is good. 1. the residuals vs fitted plot: we would expect to see random points/residuals which means that the model is able to capture the overarching trend in the data. if we were to see some shape( for example a parabola) we would determine that the the linear model does not capture the data well. It is a good fit if the red line (showing the mean of the data points) is close to the horizontal gray dotted line.
2. The QQ plot: these points should be on the diagonal which means the residuals are normally distributed and the pvalues and SE are reliable. small deviations from this line are not a problem.
3.standardized residuals vs leverage: these points should not be out of the cooks distance of 1. otherwise these would be high influence points which could affect the slope and the estimates of the lm model would not be reliable.

par(mfrow=c(2,2))
plot(lm2)


Chapter 3 Logistic Regression

library(ggplot2)
library(ggpubr)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(caret)
## Loading required package: lattice
library(readr)
library(tidyr)
library(dplyr)

Welcome to this weeks learning exercise where we look into cross validation and logistic regression!

This weeks data is from a questionnare on student performance with an emphasis on the effect of alcohol. There are 2 parts to this analysis:

Part 1: Data Wrangling

We joined 2 datasets together to form the basis for this analysis. Please see R script **./data/create_alc.R” for more information.

alko=read_csv("./data/alko_data.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(alko)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school    : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex       : chr [1:370] "F" "F" "F" "F" ...
##  $ age       : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr [1:370] "U" "U" "U" "U" ...
##  $ famsize   : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr [1:370] "A" "T" "T" "T" ...
##  $ Medu      : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr [1:370] "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr [1:370] "teacher" "other" "other" "services" ...
##  $ reason    : chr [1:370] "course" "course" "other" "home" ...
##  $ guardian  : chr [1:370] "mother" "father" "mother" "mother" ...
##  $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
##  $ famsup    : chr [1:370] "no" "yes" "no" "yes" ...
##  $ activities: chr [1:370] "no" "no" "no" "yes" ...
##  $ nursery   : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet  : chr [1:370] "no" "yes" "yes" "yes" ...
##  $ romantic  : chr [1:370] "no" "no" "no" "yes" ...
##  $ famrel    : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr [1:370] "no" "no" "yes" "yes" ...
##  $ absences  : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
##  $ alko_use  : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   paid = col_character(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   alko_use = col_double(),
##   ..   high_use = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>

From this we can see the structure of the data with 370 rows and 35 variables. These variables are printed below.

colnames(alko)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alko_use"   "high_use"

The original source of the data is here: https://www.archive.ics.uci.edu/dataset/320/student+performance. This data was merged from 2 datasets from secondary schools in 2 Portuguese schools. The variables include those like( i will not list them all):
* student grades(G1,G2,G3).

  • guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’).

  • Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high).

  • Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high).

  • absences - number of school absences (numeric: from 0 to 93).

The two original datasets consist of data from 2 classes: Math (mat) and Portuguese(por) which were merged into one. Originally, there were 33 variables but we construct 2 more:

  • alko_use as the average of the answers related to weekday and weekend alcohol consumption.

  • logical data TRUE for students for which ‘alko_use’ is greater than 2 (and FALSE otherwise.

#alko1=read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv")
#dim(alko1) ## sanity check that the datasets are the same and the data wrangling was successfull

Part 2: Analysis

The purpose now is to study the relationships between high/low alcohol consumption and some of the other variables in the data. We choose 4 interesting variables and look at their relationship with high_use. * AGE: I expect more heavy drinkers in older students.

  • GOOUT: Influence and peer pressure heavily associated with drinking in teenagers.

  • ABSENCES: More school absence with high alcohol use (hungover recovery).

  • G3: Higher grades for those that don’t drink.

a=ggplot(alko, aes(x = as.factor(high_use), y = G3))+geom_boxplot() + ylab("grade")+theme_classic()+ggtitle("Grades and High Use")
b=ggplot(alko, aes(x = high_use, y = absences))+geom_boxplot() + ylab("absences")+theme_classic()+ggtitle("Absences and High Use")
c=ggplot(alko, aes(x = high_use, y = age))+geom_boxplot() + ylab("age")+theme_classic()+
    ggtitle("Age and High Use")
d=ggplot(alko, aes(x = high_use, y = goout))+geom_boxplot() + ylab("goout")+theme_classic()+
    ggtitle("Going Out and High Use")
ggarrange(a, b, c, d,labels = c("A", "B", "C", "D"),ncol = 2, nrow = 2)

From these plots we can see that students that do not drink(or are heavy drinkers) have higher grades,less absences, younger, and dont go out as much(sadly). We can also look at cross tabulations but I dont find them useful as a visual aid– to much numbers.

tabyl(alko, famrel, alko_use, high_use)
## $`FALSE`
##  famrel  1 1.5  2 2.5 3 3.5 4 4.5 5
##       1  2   2  2   0 0   0 0   0 0
##       2  6   1  2   0 0   0 0   0 0
##       3 17  12 10   0 0   0 0   0 0
##       4 70  27 31   0 0   0 0   0 0
##       5 45  21 11   0 0   0 0   0 0
## 
## $`TRUE`
##  famrel 1 1.5 2 2.5  3 3.5 4 4.5 5
##       1 0   0 0   0  0   0 1   0 1
##       2 0   0 0   4  2   2 0   0 1
##       3 0   0 0  12  9   2 1   0 1
##       4 0   0 0  19 14   8 5   2 4
##       5 0   0 0   6  7   5 2   1 2

I will not subset the data just so it is easier to work with ( so i can use the .)

alko_use_select=select(alko, absences, G3, age, goout, high_use)

Model with a logistic regression(because we have a logical outcome). We can see that absences and goout are the significant variables.

glm.1=glm(high_use ~., data = alko_use_select, family = "binomial")
summary(glm.1)
## 
## Call:
## glm(formula = high_use ~ ., family = "binomial", data = alko_use_select)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.77834    1.96186  -1.926  0.05412 .  
## absences     0.07315    0.02237   3.270  0.00108 ** 
## G3          -0.04472    0.03923  -1.140  0.25435    
## age          0.04563    0.11022   0.414  0.67890    
## goout        0.70668    0.11917   5.930 3.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 388.67  on 365  degrees of freedom
## AIC: 398.67
## 
## Number of Fisher Scoring iterations: 4

The model coefficients

coef(glm.1)
## (Intercept)    absences          G3         age       goout 
## -3.77834234  0.07314823 -0.04471930  0.04562870  0.70667982

Calculate the log-odds ratio and confidence intervals. * For one unit increase in ‘absences’, there is a 1.07589001 increase in the log odds of having high consumption of alcohol. This is in line with my hypothesis that high alcohol consumptiion and absesnces are related.

  • For one unit increase in ‘goout’, there is 2.20 increase in the log odds of having high consumption of alcohol. Once again, this is consistent with my hypothesis of more often going out with friends is associated with more consumption of alcohol.

  • For one unit increase in ‘G3’, there is a 0.95626587 increase in the log odds having high consumption of alcohol. Consistent with our hypothesis but weak.

  • Lastly, For one unit increase in ‘age’, there is a 1.04668570 increase in the log odds having high consumption of alcohol. Consistent with our hypothesis.

LOR=coef(glm.1) %>% exp
CI=confint(glm.1) %>% exp
## Waiting for profiling to be done...
cbind(LOR, CI)
##                    LOR        2.5 %   97.5 %
## (Intercept) 0.02286056 0.0004637374 1.033922
## absences    1.07589001 1.0312932049 1.127070
## G3          0.95626587 0.8852281993 1.032878
## age         1.04668570 0.8430071337 1.299954
## goout       2.02724925 1.6139414455 2.577757

To evaluate the model, we can look at the deviance

with(glm.1, null.deviance - deviance)
## [1] 63.3708
with(glm.1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 5.669695e-13

We have a significant chiseq so this model is better than a null model

We can use pseudo R2 with 0.14 which means only a small part of the deviance is explained by this model which isnt amazing. However, this isnt binded between 0 and 1 like R2

library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(glm.1)
## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -194.3343924 -226.0197918   63.3707987    0.1401886    0.1574080    0.2231852

Using only significant variables with a 0.76 accuracy

alko_use_select2=select(alko_use_select, high_use, absences, goout)
glm2=glm(high_use ~ absences + goout, data = alko_use_select2, family = "binomial")
summary(glm2)
## 
## Call:
## glm(formula = high_use ~ absences + goout, family = "binomial", 
##     data = alko_use_select2)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.63280    0.43792  -8.296  < 2e-16 ***
## absences     0.07632    0.02225   3.430 0.000602 ***
## goout        0.73380    0.11786   6.226 4.78e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 390.30  on 367  degrees of freedom
## AIC: 396.3
## 
## Number of Fisher Scoring iterations: 4
alko_use_select2$predicted_values=predict(glm2, newdata = alko_use_select2, type = "response")
alko_use_select2$predicted_values=alko_use_select2$predicted_values > 0.5
cm=confusionMatrix(as.factor(alko_use_select2$predicted_values), as.factor(alko_use_select2$high_use), mode = "everything")

plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))

ggplot(plt, aes(Prediction,Reference, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="#009194")

## (46+236)/370=0.76

Random guessing with 0.48 accuracy which is worse

set.seed(1)
high_use_random=ifelse(rbinom(370, 1, 0.5)==1,"TRUE","FALSE")
table(high_use_random, alko_use_select2$high_use)
##                
## high_use_random FALSE TRUE
##           FALSE   134   65
##           TRUE    125   46
##(134+46)/370

Then we can start generalizing with a 10 fold cross valiation

train_=trainControl(method = "cv", number = 10)

cv =train(factor(high_use) ~.,
               data = alko_use_select2,
               trControl = train_,
               method = "glm",
               family=binomial())
cv
## Generalized Linear Model 
## 
## 370 samples
##   3 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 334, 333, 332, 333, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7622768  0.3596589

Accuracy of 0.7 is an error of 0.2378378 which means is a lower error than the one given(0.26 but we can try other models just for fun)! For example this model has an accuracy of 0.7351667 and an error of 0.2648333 which is what the exercise set gives us.

cv1 =train(factor(high_use) ~ age + famrel + goout + health, data = alko,
               trControl = train_,
               method = "glm",
               family=binomial())
cv1
## Generalized Linear Model 
## 
## 370 samples
##   4 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 333, 333, 333, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7459459  0.3132404

Lets try more models: we start with a full model. This consists of 34 predictors and from now we know there will definitely be overfitting.

 cvfull=train(factor(high_use) ~ .,
               data = alko,
               trControl = train_,
               method = "glm",
               family=binomial())
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cvfull
## Generalized Linear Model 
## 
## 370 samples
##  34 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 334, 333, 333, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   1         1

And thats exactly what we get. The output says that glm didnt converge so the results are misleading. We get an accuracy of 100% which is suspicious and I wouldnt trust it. We can try another model with half the predictors(17 random pred).

index=sample(34,17,replace=FALSE)
cvhalf=train(factor(high_use) ~ .,
               data = alko[,c(index,35)],
               trControl = train_,
               method = "glm",
               family=binomial())
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cvhalf
## Generalized Linear Model 
## 
## 370 samples
##  17 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 334, 333, 333, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   1         1

Again we get the same results. We can look into a stepwise selection and see which variables come up as most interesting–feature selection. It turns out only 2 variables are important Dalc and Walc we can try those out. This is interesting as the high use is based on alko_use which was calculated with these variables.

alko\(alko_use = (alko\)Dalc + alko\(Walc)/2 alko\)high_use=ifelse(alko_use>2,TRUE,FALSE)

But when we use these 2, we get overfitting.

cvstwo=train(factor(high_use) ~ Walc+Dalc,
               data = alko,
               trControl = train_,
               method = "glm",
               family=binomial())
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cvstwo
## Generalized Linear Model 
## 
## 370 samples
##   2 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 334, 333, 332, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   1         1
cv_logistic =function(num_predictors) {
    if(num_predictors> 34){
        break
    }
    predictors=names(alko)[c(1:(num_predictors))] 
    form= as.formula(paste("as.factor(high_use) ~", paste(predictors, collapse = " + ")))
  
    model=train(form,
                 data = alko,
                 trControl = train_,
                 method = "glm",
                 family = binomial())
  
  return(error = model$results$Accuracy)
}

num_predictors_range =seq(ncol(alko) - 1, 1, by = -1)
cv_results=sapply(num_predictors_range, cv_logistic)

plot(num_predictors_range,1-cv_results, type = 'b', col = 'blue', pch = 16, xlab = 'Number of Predictors', ylab = 'Error', main = 'Cross-Validation Performance')


Chapter 4 Linear Discriminant Analysis (LDA)

set.seed(199)
library(MASS)
library(ggplot2)
library(GGally)
library(corrplot)
library(tidyr)
library(ggord)
library(cluster) 
library(factoextra)

To get started, we load the Boston Dataset from the MASS package and get a glimpse of the data. This dataset consists of information on 506 houses in Boston. It has 14 variables. I will list a few:

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

We can look at their relationship with ggpairs. rm seems to be the only variable that is normally distributed.

GGally::ggpairs(Boston)

The correlation matrix can also be used to analyze relationships between variables. For example, there is a negative correlation between nox and dis. A positive correlation between tax and nox.

cor_matrix=cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

We can now scale the data.

boston_scale=data.frame(scale(Boston))
summary(boston_scale)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
boston_scale[1:5,1:5]
##         crim         zn      indus       chas        nox
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581
Boston[1:5,1:5]
##      crim zn indus chas   nox
## 1 0.00632 18  2.31    0 0.538
## 2 0.02731  0  7.07    0 0.469
## 3 0.02729  0  7.07    0 0.469
## 4 0.03237  0  2.18    0 0.458
## 5 0.06905  0  2.18    0 0.458

We will also create a new categorical variable based on the quantiles. I split into 4 categories: low, low_medium, high_medium and high

breaks=quantile(boston_scale$crim)
breaks
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime_scaled_cat =cut(boston_scale$crim, breaks = breaks, include.lowest = TRUE, labels = c("low", "medium_low", "medium_high", "high"))
table(crime_scaled_cat)
## crime_scaled_cat
##         low  medium_low medium_high        high 
##         127         126         126         127
temp=data.frame(boston_scale$crim,crime_scaled_cat)
temp[sample(1:length(boston_scale$crim),10,replace=FALSE),]
##     boston_scale.crim crime_scaled_cat
## 307       -0.41137883              low
## 482        0.24352095             high
## 491       -0.39598276       medium_low
## 39        -0.39975069       medium_low
## 362        0.02596236             high
## 67        -0.41501074              low
## 125       -0.40865141       medium_low
## 223       -0.34760773      medium_high
## 481        0.25698714             high
## 489       -0.40256297       medium_low
boston_scale$crim=NULL
boston_scale$crime_scaled_cat=crime_scaled_cat

Now, we want to divide the dataset into training and testing set. The training set will be composed of 80% of the data.

training_set=sample(nrow(boston_scale), nrow(boston_scale)*.8)
train=boston_scale[training_set,]
test=boston_scale[-training_set,]

Fit the linear discriminant analysis can be fit on the training data with 94% explained with LD1. LDA is a classification and dimensionality reduction method that fits linear combinations of variables to seperate the target variable which can be multi-class or binary.

lda_training=lda(crime_scaled_cat~., train)
lda_training
## Call:
## lda(crime_scaled_cat ~ ., data = train)
## 
## Prior probabilities of groups:
##         low  medium_low medium_high        high 
##   0.2500000   0.2425743   0.2450495   0.2623762 
## 
## Group means:
##                      zn      indus         chas        nox         rm
## low          0.86487545 -0.8990743 -0.155385496 -0.8389011  0.4558336
## medium_low  -0.07575093 -0.2774956  0.008892378 -0.5556718 -0.1492985
## medium_high -0.38156320  0.1692037  0.165126514  0.3212364  0.2051377
## high        -0.48724019  1.0170298 -0.123759247  1.0511511 -0.4286567
##                    age        dis        rad        tax     ptratio      black
## low         -0.8334632  0.7985302 -0.6805407 -0.7420261 -0.39791957  0.3753579
## medium_low  -0.3282985  0.3297812 -0.5470944 -0.4671799 -0.06948515  0.3152093
## medium_high  0.3603443 -0.3478542 -0.3739957 -0.2948413 -0.28973051  0.1402815
## high         0.8117090 -0.8738603  1.6390172  1.5146914  0.78181164 -0.8900971
##                   lstat         medv
## low         -0.74684432  0.516936891
## medium_low  -0.12574618  0.009746835
## medium_high -0.07154987  0.292419861
## high         0.96578752 -0.742109591
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.08674173  0.655178733 -0.80015979
## indus    0.03744995 -0.313652772  0.31995982
## chas    -0.09669696 -0.064330933  0.13106443
## nox      0.41352402 -0.782548101 -1.42602928
## rm      -0.06165354 -0.095015482 -0.24679643
## age      0.27354470 -0.290313011 -0.03121271
## dis     -0.03808615 -0.294222502  0.09415368
## rad      3.01185937  0.916333257 -0.22597380
## tax     -0.02962922 -0.051343716  0.64208917
## ptratio  0.09954469  0.025313275 -0.26546086
## black   -0.14904020  0.006966262  0.09945314
## lstat    0.23359305 -0.172953148  0.36480426
## medv     0.16687433 -0.440038019 -0.20341341
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9545 0.0331 0.0124

We can look at the biplot were rad appears to be the most important variable in discriminanting between classes. It also implies that rad is better able to influence the crim variable. They are also highly correlated if we look at the correlation plot from above.

ggord(lda_training, train$crime_scaled_cat)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime_scaled_cat)

# plot the lda results
plot(lda_training, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_training, myscale = 1)

Then, remove the outcome variable from the test set and use it for the predictions.

crime_test=test$crime_scaled_cat
test$crime_scaled_cat=NULL

This model appears to be good at discrimanting the high classes.

test_pred=predict(lda_training,test)
table(correct = crime_test, predicted = test_pred$class)
##              predicted
## correct       low medium_low medium_high high
##   low          21          5           0    0
##   medium_low    6         21           1    0
##   medium_high   1          9          17    0
##   high          0          0           0   21

This model has an accuracy of 78%!

library(caret)
con_table=confusionMatrix(data=test_pred$class, reference = crime_test)
con_table
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    low medium_low medium_high high
##   low          21          6           1    0
##   medium_low    5         21           9    0
##   medium_high   0          1          17    0
##   high          0          0           0   21
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7843          
##                  95% CI : (0.6919, 0.8596)
##     No Information Rate : 0.2745          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7112          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: low Class: medium_low Class: medium_high
## Sensitivity              0.8077            0.7500             0.6296
## Specificity              0.9079            0.8108             0.9867
## Pos Pred Value           0.7500            0.6000             0.9444
## Neg Pred Value           0.9324            0.8955             0.8810
## Prevalence               0.2549            0.2745             0.2647
## Detection Rate           0.2059            0.2059             0.1667
## Detection Prevalence     0.2745            0.3431             0.1765
## Balanced Accuracy        0.8578            0.7804             0.8081
##                      Class: high
## Sensitivity               1.0000
## Specificity               1.0000
## Pos Pred Value            1.0000
## Neg Pred Value            1.0000
## Prevalence                0.2059
## Detection Rate            0.2059
## Detection Prevalence      0.2059
## Balanced Accuracy         1.0000

Now we can look at k means. The data is re-loaded and re-scaled. Distance is calculated between every observation using euclidean distance (default).

data(Boston)
boston_scale=scale(Boston)
boston_scale=as.data.frame(boston_scale)
dist_boston_eucl=dist(boston_scale)
summary(dist_boston_eucl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

A preliminary test is done first just to get a clue on kmeans. Then we calculate kmeans with different k values to make the scree plot. Im thinking 2 clusters but I will look at the silhouette distance to also look at which clusters to take. In this case we use the TWCSS which changes between 1 to 2.

km_euc = kmeans(dist_boston_eucl, centers = 4)

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_boston_eucl, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

silhouette_score <- function(k){
  km <- kmeans(dist_boston_eucl, centers = k)
  ss <- silhouette(km$cluster, dist_boston_eucl)
  mean(ss[, 3])
}
k <- 2:10
avg_sil <- sapply(k, silhouette_score)
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)

Using 2 clusters, recalculate k means

km_euc_2 = kmeans(dist_boston_eucl, centers = 2)
boston_scale$km_euc_2=km_euc_2$cluster
boston_scale[1:5,]
##         crim         zn      indus       chas        nox        rm        age
## 1 -0.4193669  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
##        dis        rad        tax    ptratio     black      lstat       medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 1.076671 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 1.076671 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
##   km_euc_2
## 1        1
## 2        1
## 3        1
## 4        1
## 5        1
pairs(boston_scale[1:5], col = boston_scale$km_euc_2)

Redo the same with 4 clusters on the whole original dataset and refit the model using the kmeans clustering. This model LD1 is able to discriminate 80% of the data.

set.seed(123)
data("Boston")
boston_scale=scale(Boston)
boston_scale=as.data.frame(boston_scale)
km=kmeans(boston_scale, centers = 3)

boston_b =boston_scale %>% 
  mutate(crim = as.factor(km$cluster))

training_index=sample(nrow(boston_b),size=nrow(boston_b)*0.8)
train1=boston_b[training_index,]
test1=boston_b[-training_index,]
lda.fit4=lda(crim ~ ., data = train1)
lda.fit4
## Call:
## lda(crim ~ ., data = train1)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2648515 0.4108911 0.3242574 
## 
## Group means:
##           zn       indus        chas         nox         rm        age
## 1 -0.4872402  1.04730816  0.02203357  1.04361007 -0.3235550  0.7718928
## 2 -0.4098805 -0.01560851  0.17830289  0.02207459 -0.2182658  0.3629085
## 3  0.9031573 -0.95171033 -0.12205807 -0.94944506  0.5913865 -1.0936979
##          dis        rad        tax    ptratio      black       lstat
## 1 -0.8347830  1.5554895  1.4906517  0.7336867 -0.6683009  0.77463317
## 2 -0.2418567 -0.5695300 -0.4734350 -0.1160846  0.2441335  0.07308108
## 3  1.0775158 -0.5970032 -0.6727432 -0.4522967  0.3488940 -0.85317921
##          medv
## 1 -0.61489314
## 2 -0.08812743
## 3  0.63096726
## 
## Coefficients of linear discriminants:
##                 LD1          LD2
## zn      -0.11135414  0.404232974
## indus   -0.24097333 -0.169503286
## chas    -0.03448550 -0.103442454
## nox     -0.25398584  0.123736788
## rm       0.09113006  0.416221698
## age     -0.10619045 -0.917270414
## dis      0.18493191  0.493827453
## rad     -2.60905423  0.903652155
## tax     -0.88492911  0.268708395
## ptratio -0.10172379 -0.001364374
## black    0.16721983 -0.103047554
## lstat   -0.06586241  0.051311795
## medv     0.09222017  0.216163863
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9061 0.0939

Rad is most influential variable for separating and discriminating and that this varuable influence mainly class 1. Age is the next most influential and is better able to seperate classes 2 and 3

ggord(lda.fit4, train1$crim)

Look at the 3D biplot

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
crime_train=train$crime_scaled_cat
model_predictors <- dplyr::select(train, -crime_scaled_cat)

km_euc_2 = kmeans(dist(model_predictors), centers = 2)

matrix_product <- as.matrix(model_predictors) %*% lda_training$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color =crime_train )

I am very sick so i will do the bare minimum

human=read_csv("./data/human_data.csv",)
## New names:
## Rows: 155 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Country dbl (9): ...1, edu2Ratio, labRatio, Edu.Exp, Life.Exp, GNI,
## Mat.Mor, Ado.Bir...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
summary(human)
##       ...1         Country            edu2Ratio         labRatio     
##  Min.   :  1.0   Length:155         Min.   :0.1717   Min.   :0.1857  
##  1st Qu.: 39.5   Class :character   1st Qu.:0.7264   1st Qu.:0.5984  
##  Median : 78.0   Mode  :character   Median :0.9375   Median :0.7535  
##  Mean   : 78.0                      Mean   :0.8529   Mean   :0.7074  
##  3rd Qu.:116.5                      3rd Qu.:0.9968   3rd Qu.:0.8535  
##  Max.   :155.0                      Max.   :1.4967   Max.   :1.0380  
##     Edu.Exp         Life.Exp          GNI            Mat.Mor      
##  Min.   : 5.40   Min.   :49.00   Min.   :   581   Min.   :   1.0  
##  1st Qu.:11.25   1st Qu.:66.30   1st Qu.:  4198   1st Qu.:  11.5  
##  Median :13.50   Median :74.20   Median : 12040   Median :  49.0  
##  Mean   :13.18   Mean   :71.65   Mean   : 17628   Mean   : 149.1  
##  3rd Qu.:15.20   3rd Qu.:77.25   3rd Qu.: 24512   3rd Qu.: 190.0  
##  Max.   :20.20   Max.   :83.50   Max.   :123124   Max.   :1100.0  
##    Ado.Birth         Parli.F     
##  Min.   :  0.60   Min.   : 0.00  
##  1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 33.60   Median :19.30  
##  Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :204.80   Max.   :57.50
rownames(human)=human$Country
## Warning: Setting row names on a tibble is deprecated.
human=human[,3:10]
head(human)
## # A tibble: 6 × 8
##   edu2Ratio labRatio Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
##       <dbl>    <dbl>   <dbl>    <dbl> <dbl>   <dbl>     <dbl>   <dbl>
## 1     1.01     0.891    17.5     81.6 64992       4       7.8    39.6
## 2     0.997    0.819    20.2     82.4 42261       6      12.1    30.5
## 3     0.983    0.825    15.8     83   56431       6       1.9    28.5
## 4     0.989    0.884    18.7     80.2 44025       5       5.1    38  
## 5     0.969    0.829    17.9     81.6 45435       6       6.2    36.9
## 6     0.993    0.807    16.5     80.9 43919       7       3.8    36.9
colnames(human)
## [1] "edu2Ratio" "labRatio"  "Edu.Exp"   "Life.Exp"  "GNI"       "Mat.Mor"  
## [7] "Ado.Birth" "Parli.F"

most are not normally distibuted expect maybe Edu.Exp

par(mfrow = c(2,4))
hist(human$edu2Ratio, main="edu2Ratio", xlab="")
hist(human$labRatio, main="labRatio", xlab="")
hist(human$Edu.Exp, main="Edu.Exp", xlab="")
hist(human$Life.Exp, main="Life.Exp", xlab="")
hist(human$GNI, main="GNI", xlab="")
hist(human$Mat.Mor, main="Mat.Mor", xlab="")
hist(human$Ado.Birth, main="Ado.Birth", xlab="")
hist(human$Parli.F, main="Parli.F", xlab="")

summary(human)
##    edu2Ratio         labRatio         Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

correlation plot: we have some strong correlations like edu.exp and life exp

library(corrplot)
cor_matrix=cor(human) %>% round(digits = 2)
cor_matrix
##           edu2Ratio labRatio Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## edu2Ratio      1.00     0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## labRatio       0.01     1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp        0.59     0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp       0.58    -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI            0.43    -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor       -0.66     0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth     -0.53     0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F        0.08     0.25    0.21     0.17  0.09   -0.09     -0.07    1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

now we do pca on the non standardized dataset

pca_human_raw= prcomp(human)
sum1=summary(pca_human_raw)
pca_imp_raw = round(100*sum1$importance[2,], digits = 1)
pca_imp_raw
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
biplot(pca_human_raw)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

now we do pca on standardized data

human_std=scale(human)
pca_human_std =prcomp(human_std)
sum2 <- summary(pca_human_std)
sum2
## Importance of components:
##                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.071 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.536 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.536 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

This seems more reasonable the other had fully 100% explained by PC1

pca_pr2_std = round(100*sum2$importance[2,], digits = 1)
pca_pr2_std
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
biplot(pca_human_std,cex = c(0.5, 0.5),)

The results of a pca on standardized vs non standardized are different. In the 1st(raw), 100% importance on GNI. Based on standardized data, we have postove results with education and poor results with maternal mortality and adolescent birth rate. The 1st PC explained 54 and the second 16% of variation in the data.

multiple correspondance analysis

library(tidyr)
library(factoextra)
library(ggplot2)
library(dplyr)
library(FactoMineR)
tea = read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

tea_data= tea[, c(4,13:17)]
summary(tea_data)
##        lunch            Tea         How           sugar    
##  lunch    : 44   black    : 74   alone:195   No.sugar:155  
##  Not.lunch:256   Earl Grey:193   lemon: 33   sugar   :145  
##                  green    : 33   milk : 63                 
##                                  other:  9                 
##                  how                       where    
##  tea bag           :170   chain store         :192  
##  tea bag+unpackaged: 94   chain store+tea shop: 78  
##  unpackaged        : 36   tea shop            : 30  
## 
View(tea_data)
gather(tea_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables; they will be
## dropped

mca_tea = MCA(tea_data, graph = FALSE)
mca_tea
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 6 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$var$eta2"       "coord. of variables"             
## 8  "$ind"            "results for the individuals"     
## 9  "$ind$coord"      "coord. for the individuals"      
## 10 "$ind$cos2"       "cos2 for the individuals"        
## 11 "$ind$contrib"    "contributions of the individuals"
## 12 "$call"           "intermediate results"            
## 13 "$call$marge.col" "weights of columns"              
## 14 "$call$marge.li"  "weights of rows"
summary(mca_tea)
## 
## Call:
## MCA(X = tea_data, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## 1         | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327  0.163  0.104
## 2         | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695  0.735  0.314
## 3         | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202  0.062  0.069
## 4         | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211  0.068  0.073
## 5         | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202  0.062  0.069
## 6         | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202  0.062  0.069
## 7         | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202  0.062  0.069
## 8         | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695  0.735  0.314
## 9         |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067  0.007  0.003
## 10        |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650  0.643  0.261
##            
## 1         |
## 2         |
## 3         |
## 4         |
## 5         |
## 6         |
## 7         |
## 8         |
## 9         |
## 10        |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test  
## lunch     |   0.007   0.000   0.000   0.048 |   0.608   3.468   0.064   4.362 |
## Not.lunch |  -0.001   0.000   0.000  -0.048 |  -0.105   0.596   0.064  -4.362 |
## black     |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003   0.929 |
## Earl Grey |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027   2.867 |
## green     |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107  -5.669 |
## alone     |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127  -6.164 |
## lemon     |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035   3.226 |
## milk      |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020   2.422 |
## other     |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102   5.534 |
## No.sugar  |   0.247   1.877   0.065   4.412 |   0.026   0.023   0.001   0.473 |
##             Dim.3     ctr    cos2  v.test  
## lunch       0.804   7.204   0.111   5.764 |
## Not.lunch  -0.138   1.238   0.111  -5.764 |
## black      -1.081  21.888   0.382 -10.692 |
## Earl Grey   0.433   9.160   0.338  10.053 |
## green      -0.108   0.098   0.001  -0.659 |
## alone      -0.113   0.627   0.024  -2.655 |
## lemon       1.329  14.771   0.218   8.081 |
## milk        0.013   0.003   0.000   0.116 |
## other      -2.524  14.526   0.197  -7.676 |
## No.sugar   -0.561  12.347   0.336 -10.026 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## lunch     | 0.000 0.064 0.111 |
## Tea       | 0.126 0.108 0.410 |
## How       | 0.076 0.190 0.394 |
## sugar     | 0.065 0.001 0.336 |
## how       | 0.708 0.522 0.010 |
## where     | 0.702 0.681 0.055 |
fviz_screeplot(mca_tea, addlabels = TRUE)

from the scree plot we can see that the 1st 4 dimensions explains 50% of variability. In the biplot, distance means similarity so closer points are more similar.

plot(mca_tea,invisible=c("ind"))